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(54) Mesh interconnected array in a fault-tolerant computer system 



(57) Bus interface units (BlUs) (54) perform fault 
detection, identification, and reconfiguration for all infor- 
mation transfers between redundant central processing 
units (CPUs) (56) and memory or input/output (I/O) 
(57A-C) in a mesh interconnected array of a highly reli- 
able fault-tolerant computer system. Errors are detected 
by self-checking within the BlUs, signal parity checks by 
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the BlUs, cross channel comparisons, and mesh trans- 
action assessments. Fault identification and mesh 
reconfiguration for the mesh is performed such that no 
faulty unit remains active in decision making after recon- 
figuration, and the number of good units isolated during 
reconfiguration is minimized. 
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Description 

Field of the Invention 

The present invention generally relates to a fault- 
tolerant computer system and, more particularly, to 
improved fault identification and reconfiguration per- 
formed in a mesh array computer architecture. 

Background of the Invention 

Fault-tolerant computing is the art of building com- 
puting systems that continue to operate satisfactorily in 
the presence of faults (i.e. hardware or software fail- 
ures). For example, large commercial aircraft typically 
include complex flight control computer systems. To 
ensure the safety of the passengers in the event of a 
fault, the avionics systems of such aircraft can include 
three or more redundant computer systems. The sys- 
tems process the same data from a common source or 
redundant data from different sources monitoring the 
same parameter. Should a fault cause an error in the 
output of one of the redundant computer systems, the 
outputs of the other systems that agree are used by the 
avionics system. The step of selecting data for use in a 
process from among the outputs of redundant sources 
is called "voting," which is a form of "fault masking" 
because the erroneous component is ignored. 

Although noise or other transient occurrences can 
produce one-time disagreements that do not have long 
term adverse impact on the overall avionics system, 
continuing disagreements usually indicate a failed com- 
ponent or a break in communications on one of the 
channels. Accordingly, fault-tolerant systems can be 
configured to lock out one of the redundant channels if 
it continues to produce results that differ from the other 
channels. Reconfiguring a fault-tolerant system in this 
manner is a form of "dynamic recovery." In another 
dynamic recovery procedure faulty modules are identi- 
fied and switched out of the system and are substituted 
with a system spare. Thereafter, the system instigates 
roll back, initialization, retry, or restart actions neces- 
sary to continue ongoing computations. 

One effective fault-tolerant "uni-processor architec- 
ture" (multiple processors acting as one processor) that 
uses fault masking and dynamic recovery is the mesh 
interconnected array. The mesh interconnected array is 
a matrix of nodes connecting identical central process- 
ing units (CPUs) to identical memory modules. This 
approach can be represented as multiple identical 
CPUs horizontally disposed with a CPU bus extending 
vertically below each CPU. Multiple identical memory 
modules are vertically disposed with memory buses 
N extending horizontally to intersect the CPU buses at 
nodes or bus interface units (BlUs). Existing mesh 
architectures perform fault masking and dynamic recov- 
ery by comparing pairs of BlUs. One BIU acts as a vot- 
ing master and another as a voting checker. The 
checker is compared with the master and does not 



actively pass data. This is more commonly known as 
master-checker pairs and is an effective masking tech- 
nique in fault-tolerant systems. However, the master- 
checker relationship relies on strict majority vote rules 

5 and ignores the uniqueness of the mesh architecture 
array. Also, these systems assume fault free voting 
mechanisms and therefore fail to perform error analysis 
within the BIU circuitry. Other examples of prior art fault- 
tolerant architectures are shown in Budde et al. U.S. 

70 patents 4,438,494; 4,503,534; and 4,503,535. 

Summary of the Invention 

The present invention provides a reliable mesh 

15 interconnected array in a fault-tolerant computer sys- 
tem. The system includes multiple central processing 
units (CPUs), multiple memory units, and bus interface 
units (BlUs). The BlUs are located at the intersections 
of multiple vertical CPU buses and multiple horizontal 

20 memory buses. The result is a matrix of B lUs. Each BIU 
contains a transaction controller for transmitting a trans- 
action including at least one of the following: address 
data, read data, write data, or control data. A mesh bus 
connects all BlUs within a single array of BlUs. The 

25 BlUs comprise an error detector for detecting correcta- 
ble, retryable and non-retryable errors in the transmitted 
data of a transaction. An error corrector in each BIU cor- 
rects any correctable errors detected. A mesh error 
reporter in each BIU asserts to all the BlUs via the mesh 

30 bus if a retryable or non-retryable error was detected by 
the error detector. A retry mechanism sends the data 
back to the error detector a preset number of times if a 
retryable error remains asserted on the mesh bits. A 
fault manager isolates and eliminates BlUs with 

35 detected errors remaining after operation of the retry 
mechanism by reconfiguring the active status of each 
BIU connected on the mesh bus according to the type of 
remaining detected error(s). A continuation mechanism 
completes the transaction, if no error(s) remain 

40 asserted on the mesh bus. 

In accordance with other aspects of this invention, 
the error detector further comprises a single thread read 
back mechanism for detecting errors of data written to 
memory when only a single BIU or a single CPU chan- 

45 nel is active. 

In accordance with further aspects of this invention, 
the fault manager within each BIU detects any self- 
implicating errors, shuts down each BIU that detects an 
uncorrectable self-implicating error, asserts a message 

so to the mesh bus indicating the functioning status (active 
or inactive) of each BIU. and reconfigures the status of 
the BlUs by performing a configuration consistency 
algorithm. 

In accordance with yet other aspects of this inven- 
55 tion, the fault management processor within each BIU 
detects any synchronization errors, asserts a first error 
message to the mesh bits if a synchronization error was 
detected, performs a first reconfiguration algorithm if a 
first error message was asserted on the mesh bus and 
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performs the configuration consistency algorithm. 

In accordance with still further aspects of this inven- 
tion, the fault management processor within each BIU 
detects any memory bus errors, asserts a second error 
message to the mesh bus if a memory bits error was 5 
detected, performs a second reconfiguration algorithm if 
a second error message was asserted on the mesh bus 
and performs the configuration consistency algorithm. 

In accordance with yet still other aspects of this 
invention, the fault management processor within each 
BIU detects any CPU bus errors, asserts a third error 
message to the mesh bus if a CPU hits error was 
detected, performs a third reconfiguration algorithm rf a 
third error message was asserted on the mesh bus and 
performs the configuration consistency algorithm. 

In accordance with other aspects of this invention, 
the system includes as minimum architecture a CPU, at 
least one memory unit, and a BIU. The system of the 
present invention preferably includes multiple CPUs, 
multiple memory units, and multiple BlUs for greater tol- 
erance of faults. The invention provides for fault identifi- 
cation and reconfiguration in a mesh array architecture 
that allows for single channel operation. 

Brief Description of the Drawings 

The foregoing aspects and many of the attendant 
advantages of this invention will become more readily 
appreciated as the same becomes better understood by 
reference to the following detailed description, when 
taken in conjunction with the accompanying drawings, 
wherein: 

FIGURE 1 is a block diagram of a mesh intercon- 
nected array in a fault-tolerant computer system in 
accordance with the present invention using a 2 by 
3 matrix of bus interface units (BlUs); 
FIGURE 2 is a positional assignment chart of the 
diagram of FIGURE 1; 

FIGURE 3 is a block diagram of a mesh intercon- 
nected array in a fault-tolerant computer system in 
accordance with the present invention using two 2 
by 3 matrices of BlUs performing off the same 
CPUs; 

FIGURE 4 is a block diagram of the internal compo- 
nents of the BlUs of the configuration of FIGURE 1 
or FIGURE 3; 

FIGURES 5-13 are flow diagrams illustrating steps 
performed to achieve mesh reconfiguration in 
accordance with present invention; 
FIGURES 14-19 are charts illustrating algorithms 
that perform the reconfiguration; 
FIGURE 20 is a diagram of the configuration con- 
sistency check circuitry; and 
FIGURES 21 and 22 are charts illustrating master- 
ship assignment performed in the circuitry of FIG- 
URE 20. 



Detailed Description of the Preferred Embodiment 

FIGURE 1 is a block diagram of a plurality of bus 
interface units (BIU) 54 at nodes of a fault tolerant 
matrix (FTM) array in accordance with a preferred 
embodiment of the present invention Multiple redun- 
dant central processing units (CPUs) 56 have associ- 
ated CPU buses 50 intersecting memory buses 51 . The 
memory buses extend from redundant memory units 
57. In the architecture of FIGURE 1 , a 2 by 3 matrix con- 
sisting of two rows of memory units 57 and three col- 
umns of CPUs 56 is used. Each BIU 54 provides a 
connection or node between a CPU bus 50 and a mem- 
ory bus 51, i.e., a connection is provided for each col- 
umn CPU 56 to each row memory unit 57. Each BIU is 
identified first by the row number of the connected 
memory unit 57 and then by the column number of the 
connected CPU 56. All the BlUs 54 in a single FTM are 
interconnected by a mesh bits 52. As shown in FIGURE 
2, the location of each BIU 54 has an associated vector 
bit position which is part of a configuration vector in 
reconfiguration processing, described in more detail 
below with reference to FIGURES 19 and 20. 

The present invention is capable of operating vari- 
ous configurations, such as a single CPU 56 connected 
through a single BIU 54 to a single memory unit 57. The 
present invention also operates effectively rf a single 
CPU is used with multiple redundant memory units, or a 
single memory channel is used with multiple redundant 
CPUs. A single CPU channel is identified by a single 
CPU 56 and CPU bus 50 connected to at least one 
memory bits 51 with a corresponding memory unit 57. A 
single memory channel has a single memory bus 51 
and memory unit 57 connected to at least one CPU 56. 
The system is also effective when disjoint BlUs 54 are 
present. BlUs 54 are disjoint when two or more BlUs 54 
fail to occupy the same CPU bus 50 and memory bus 
51. The present invention is capable of handling matrix 
arrays larger than that shown in FIGURE 1 . However, for 
the purpose of this description, a 2 by 3 matrix array is 
effective for showing the redundant capabilities of the 
present invention. It is also noted that larger matrix 
arrays provide diminishing improvements in fault toler- 
ant capabilities. 

FIGURE 3 is a block diagram illustration of two FTM 
arrays operating from the same set of CPUs 56, but 
connected to distinct memory systems. A first 2 by 3 
mesh, mesh A, has two memory units 57 each consist- 
ing of an electrically erasable programmable read-only 
memory (EEPROM) 57a, a single port random access 
memory (SRAM) 57b and a dual port random access 
memory (DPRAM) 57c. The memory rows of a second 
2 by 3 mesh B comprise only a SRAM 57d and DPRAM 
57e. Mesh B performs distinct yet synchronous opera- 
tions from the same CPUs 56 that are coupled by mesh 
A. The CPUs 56 can operate a greater variety of func- 
tions with two FTMs connected. However, synchroniza- 
tion between the FTMs is a concern during operation. It 
can be appreciated by one of ordinary skill in the art that 
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the mesh conf iguration of the present invention can inte- 
grate with various types of memory and input/output 
(I/O). System data requirements dictate the memory 
and I/O requirements. One I/O system the present 
invention can integrate with is an aircraft system inte- 
grated modular avionics box, such as an ARINC 629 
bus attached to the DPRAMs. 

The purpose of the present invention is to effec- 
tively deliver or retrieve data from memory in the pres- 
ence of error(s). During a no error situation, ail the BlUs 
have assigned roles. Each row and column has a BIU 
assigned a master role and the rest of the BlU(s) of that 
row or column are assigned checker roles. In a memory 
write, the master BIU is the only BIU allowed to write 
data to memory in each memory channel. Each CPU 
channel has a master BIU for data transactions between 
the BlUs 54 and the CPUs 56. The checker BlUs are 
used to check the master BlUs for correctness. Only 
active BlUs participate in all checking or voting per- 
formed in the algorithms of the present invention. When 
the system is fully functioning, two important interre- 
lated tasks are performed. The first task detects errors 
internal and external to every BIU in a FTM. The other 
task is that of reconfiguration of the BlU's status accord- 
ing to the error detection task results. Reconfiguration is 
performed by a series of algorithms, described below 
with reference to FIGURES 14—21 . 

FIGURE 4 is a block diagram of the internal compo- 
nents of a bus interface unit 54 with corresponding bus 
connections. The bus interface unit 54 includes an 
address datapath 58, a data datapath 59 and a transac- 
tion controller 63 connected there between. An initiali- 
zation controller 64 also is connected between the 
address datapath 58 and data datapath 59. A fault iden- 
tification and reconfiguration (FDIR) datapath 62 inter- 
acts with the transaction controller 63 and the 
initialization controller 64, and an input/output (I/O) con- 
troller 61 interacts with all other components of the BIU. 
Each BIU 54 includes a dual state machine. As shown 
in FIGURE 4, each controller and circuit path 58, 59, 61 - 
64 has dual identical circuitry as illustrated by the shad- 
owed black boxes. The dual state machine includes the 
dual identical circuits. The dual state machine provides 
comparison checking of many of the processes per- 
formed within the bus interface unit 54. 

More specifically, as shown in FIGURE 4, the 
address datapath 58 connects externally to CPU bus 50 
(shown at the left) and memory bus 51 (shown at the 
right). Internally, the address datapath 58 is coupled to 
transaction controller 63, initialization controller 64, and 
the data datapath 59. The address datapath 58 receives 
address data from the CPU bits 50 via a CPU address 
bus 66 and communicates with the memory bus 51 via 
a memory address bus 67. The data datapath 59 is 
internally coupled to transaction controller 63, initializa- 
tion controller 64 and address datapath 58 (represented 
by the bold line 70 at the center). The data datapath 59 
externally communicates with CPU bus 50 via a CPU 
data bus 58 and memory bus 51 via a memory data bits 



69. Transaction controller 63 receives control data from 
CPU bus 50 via a CPU control bus 72 and is internally 
connected to all of the BlU's internal components 
except the initialization controller 64. Initialization con- 

5 troller 64 has no direct external connections and is inter- 
nally connected to all of the BlU's components except 
the transaction controller 63. I/O controller 61 communi- 
cates memory control data to and from memory bus 51 
via a memory control bus 71 and communicates with 

70 the other BlUs in the FTM by the mesh bus 52. Inter- 
nally, the I/O controller 61 is also connected to the FDIR 
datapath 62. 

Transaction controller 63 determines from CPU 
control data the transaction that the BIU 54 is to perform 

15 and commands each of the components to perform 
according to the determined transaction. Some exam- 
ples of transactions are data reads from memory along 
bus 69, data writes to memory along bus 69, and 
address accesses of memory along bus 67. The 

20 address datapath 58 processes address transactions 
and the data datapath 59 processes all read and write 
data transactions. Transaction controller 63 also con- 
trols the synchronization of all the components within 
the BIU 54 thus ensuring eff icient processing of all infor- 

25 mation passing in and through the BIU 54. 

Initialization controller 64 provides control of the 
BIU 54 during startup and any required instruction 
fetches required during startup. 

The FDIR datapath of each BIU 54 within an FTM 

30 stores a word (configuration vector) that corresponds to 
a status value of each BIU within the FTM. The precise 
storage location is described below with reference to 
FIGURE 20. In a FTM with 6 BlUs the configuration Vec- 
tor is a six-bit word, each bit representing the status of a 

35 BIU within the FTM. Referring to FIGURE 2, each BIU 
54 has a position in the FTM. The position has a corre- 
sponding bit in the configuration vector. For example, if 
the configuration vector was 01 1 101 , the BlUs in vector 
bit position 0, bottom left of FIGURE 1, and vector bit 

40 position 4, top center of FIGURE 1 , are considered non- 
operative because of the zeroes in those respective 
positions. The considered non-operative BlUs are 
restricted from FTM operations provided all the BlUs 54 
agree with this configuration vector, see below with ref- 

45 erence to FIGURES 19-22. The FDIR datapath 62 
reconfigures the configuration vector upon identification 
of faults from the other components within the BIU 54, 
or according to a change in status of other BlUs 
received from the mesh bus 52. The reconfiguration 

so algorithm is described below with reference to FIG- 
URES 19 and 20. 

Finally, I/O controller 61 provides an interfacing 
controlling unit with memory bus 51, mesh bus 52 and 
the internal components connected to the I/O controller 

55 61. 

Each BIU 54 in the FTM continuously detects errors 
of data the BIU 54 receives, processes, and transmits. 
Two types of detected errors are value errors and syn- 
chronization errors. A value error is an error in the data 



4 



7 



EP0 811 916 A2 



8 



received, processed or transmitted by the BIU 54. A 
synchronization error arises from inconsistencies in 
synchronization between the components within a BIU 
54, inconsistencies in synchronization between the 
BlUs within an FTM, and synchronization inconsisten- 5 
cies with a second FTM as shown in FIGURE 3. It can 
be appreciated by one of ordinary skill that any addi- 
tional units interacting with a FTM must be checked for 
synchronicity. Performance of dual state machine com- 
parisons, parity checks, complementary signal checks, w 
transaction consistency comparisons, wraparound 
comparisons of signals sent off the BIU, clock phase 
comparisons and mesh votes (described below with ref- 
erence to FIGURES 19 and 20) provide continuous 
analysis for errors that may occur anywhere in a trans- 75 
action. Also, the analysis for errors allows each BIU to 
perform self-implicating error evaluations, thus provid- 
ing a FTM that can operate with only one active BIU. 

The BlUs 54 are preprogrammed to characterize 
value and synchronization errors in two ways. The 20 
errors are first characterized as a BIU self-implicating 
error, a synchronization consistency error, a memory 
bus error or a CPU bus error. This identification is later 
used by the fault management procedure for further iso- 
lating the errors (see FIGURE 11). Also the errors are 25 
evaluated as either correctable, retryable or non-retrya- 
ble. The second evaluation greatly affects how the fault 
management procedure reacts to the discovered error 
(described in more detail below with reference to FIG- 
URES 5 and 20). 30 

Transaction Data Flow 

FIGURES 5 — 13 illustrate the fault-tolerant process 
of a preferred embodiment of the present invention. All 35 
CPU memory transactions are processed through the 
BlUs 54 and are variations of reads to and writes from 
memory. As shown in FIGURE 5. each transaction 
received by a BIU 54 is analyzed for errors during phase 
operation, specifically an address phase and a data 40 
phase (110). FIGURES 6 and 8, described below, illus- 
trate the error determining steps performed in the 
address and the data phases. If the error analysis deter- 
mines that a retryable or non-retryable error is present 
in either of the phases, a message indicating so is 45 
asserted to the other BlUs via the mesh bus 52 (111) 
(non-retryable errors are described below with refer- 
ence to FIGURE 11). However, if the error analysis 
determines that the error present was correctable, the 
data is corrected (1 12). rf no error was discovered dur- so 
ing the phase analysis at 110, or if only correctable 
errors are present and are corrected at 1 12, the system 
proceeds to a decision 1 15. If at the decision 115 it is 
determined that no error is asserted on the mesh bus 
52, the system proceeds to processing of the next trans- 55 
action. If an error is asserted on the mesh bus 52, the 
system proceeds to the fault management procedure 
116. 

FIGURE 6 illustrates the processing steps per- 



formed within the address phase of operation. Initially, a 
case selection 144 is performed on the data received in 
the address phase. Case selection 144 determines 
whether the data received is from a new transaction or 
a retry, branch 144a, or from a readback or a memory 
scrub, branch 144b. An address phase retry is per- 
formed in a retry procedure described below with refer- 
ence to FIGURE 9, and the readback case is described 
below with reference to FIGURE 13. The memory scrub 
case performs correction of incorrectly stored informa- 
tion determined correctable during a read transaction. 

If the case is a new transaction or retry, branch 
144a, the CPU drives the address at 140 to the BIU 
which latches the address at 141. Then the address is 
decoded and a cache RAM hit/miss determination is 
performed at 142. In this embodiment cache in the 
address datapath 58 provides faster access to instruc- 
tions/data stored in EEPROM. The address phase then 
performs a synchronization procedure at 145 and a nul- 
lification procedure at 147 (both procedures are illus- 
trated in FIGURE 7 and described below). Prior to the 
nullification procedure, the BIU drives the address to 
memory and memory controls are asserted at 146 onto 
the memory control bus 71 . The address phase analysis 
of the new transaction or retry, branch 144a, is complete 
upon completion of the nullification procedure, thus 
returning any errors determined in the address phase 
procedure, synchronization procedure or nullification 
procedure. 

ff the address phase procedure selects the case of 
a readback or memory scrub, branch 144b, the address 
phase procedure performs all of the same tasks as the 
retry case except the BIU uses the previously held 
address (143) and the procedure does not perform the 
nullification procedure. 

FIGURE 7 illustrates the synchronization procedure 
at 145 of FIGURE 6 and the nullification procedure at 
147 that operate within the address phase. The syn- 
chronization procedure performs three simultaneous 
functions shown as paths 169a, b and c. One function, 
169a, determines if a synchronization error occurs in 
the address phase of operation. The other two simulta- 
neous functions, paths 169b and c, drive a FTM 
address flag if the BIU thinks it is being addressed by 
the CPU and drives a mesh consistency signal if the 
BIU believes its address phase operation is consistent 
with the memory in memory units 57. The driven 
address flag is qualified with any driven error signals at 
175, then passed to the algorithm beginning at 176a 
and 180a, described below with reference to FIGURE 
16A. The driven consistency signal is similarly qualified 
with the error signals at 177 and processed through an 
algorithm at 176b and 180b, described below with refer- 
ence to FIGURE 17A. In 181, resulting actions are per- 
formed on the analysis of data from the beginning of the 
synchronization process. If no flag signal exists, no 
transaction will be prosecuted. If an address and a 
transaction consistency signal exist, then the synchroni- 
zation process indicates that the transaction is good. 
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However, if a flag is present and no transaction consist- 
ency signal is present, the transaction is illegal and will 
generate a synchronous interrupt signal for further use 
in the fault management procedure. 

The nullification procedure is illustrated with the 
synchronization procedure, since it uses the same 
driven error output of 170. The CPU 56 pipelines trans- 
actions through BlUs 54 for efficient processing. How- 
ever, situations may occur when a transaction in the 
pipeline is not necessary and must be nullified. The nul- 
lification procedure provides the FTM with the capabili- 
ties of nullifying an unwanted transaction. If a 
nullification signal is received by BID, an output signal is 
sent to the mesh bus 52 to nullify the transaction at 190. 
The nullification signal sent to the mesh bus 52 is then 
qualified by the error signal output of 1 70 and execution 
of the nullification algorithm of 185 and 1 86 is executed. 
If the consensus of the algorithm is a nullify decision, a 
non-retryable signal is produced and any disagreeing 
BIU is shutdown at 182 (non-retryable errors are 
described below with reference to FIGURE 11). The 
nullification algorithm is described in more detail with 
reference to FIGURE 18. Completion of both the nullifi- 
cation and synchronization procedures returns the 
processing to the address phase. 

The other procedure performed in the error detec- 
tion test at 110 of FIGURE 5 is the data phase proce- 
dure, shown in FIGURE 8. The data phase procedure 
includes case selection at 153 for choosing one of five 
transactions shown as paths 167a-e according to the 
data received by the data datapath 59: a read transac- 
tion 167a; a write transaction 167b; an interrupt transac- 
tion 167c; a scrub transaction 167d; a readback 
transaction 167e. If the case is a read transaction 167a, 
a data read from memory, the BIU latches the data read 
from memory at 150, checks it for errors in an error 
detecting and correcting circuit (ED AC) located within 
datapath 59 and generates parity at 152, and drives the 
data onto the CPU bus 50 at 1 56. If the case determined 
is a write transaction 167b, a data write to memory, the 
BIU latches the data from the CPU to be written to 
memory at 151, generates error correcting code (ECC) 
at 155 and drives the data received from the CPU onto 
the memory bus at 157. The write transaction includes 
a step B that is described in more detail below with FIG- 
URE 8B. If the transaction is an interrupt 167c, interrupt 
parameters are written to an interrupt register at 158. 
The interrupt transaction initiates from errors requiring 
the FTM and/or BIU to temporarily stop processing. If 
the case is a scrub transaction 167d, error correcting 
code (ECC) is generated at 162 and the BIU drives the 
data to be stored onto the memory bits and asserts 
memory controls at 161. Finally, if the case is a read- 
back transaction 167c, the memory controls are 
asserted 163, the BIU latches the data sent to memory 
at 166. checks the stored data by EDAC and compares 
that data to the received data from the CPU at 1 65. The 
scrub transaction performs corrections of correctable 
data. Completion of each of the cases within the data 



phase procedure returns the procedure back to the 
decision on page 1 7 at 1 1 0 of FIGURE 5 with any errors 
determined by the data se procedure. 

As shown in FIGURE 8B. a check is performed on 

5 the memory stored data of a write transaction during 
either single CPU channel or single memory channel 
writing to DPRAM at 127. If the data written to memory 
is correct, the process returns to phase processing 110 
of FIGURE 5. If the data is inconect but correctable 

w (131), thedata iscorrected (132). However, if thedata is 
incorrect but not correctable, an error is asserted on the 
mesh bus. 

If a retryable or non-retryable error(s) exist after 
performance of the phase operations, the fault manage- 

75 ment procedure 1 16 is initiated, as shown in FIGURES 
9—12. At decision 198, the process moves directly to a 
fault identification and reconfiguration (FDIR) procedure 
208, rf a non-retryable error was previously discovered. 
As shown in FIGURE 9, the system separately retries 

20 the address and data phase procedures at 197-205. At 
step 200, address phase procedure errors are latched 
and saved until completion of the data phase procedure. 
This allows the next transaction to start the address 
phase procedure when the present transaction starts 

25 the data phase procedure. A preset number of retries 
are performed until no errors are asserted on the mesh 
bus 52, as determined by a decision at 205. or the retry 
limit is reached, as determined by decision 207. If the 
retry purges the retryable error(s), the system proceeds 

so to processing the next transaction. If a fault is still 
asserted on the mesh bus 52 after reaching the retry 
limit, the system proceeds to the FDIR procedure 208. 

Referring to FIGURE 10, the FDIR procedure 208 
first checks for address phase errors at 210-212, then 

35 data phase errors at 215-217. If an error is detected in 
either of the phases at 212 and 21 7, the fault identifica- 
tion (FID) procedure 218 is separately initiated. Essen- 
tially at this,stage of the procedure any errors present 
are categorized into one of the two phases, address or 

40 data. The FID procedure 21 8 further isolates any errors, 
within each phase and reconfigures the mesh configu- 
ration vector according to any detected errors. 

FIGURE 11 illustrates the FID procedure 218 that 
further isolates any errors and reconfigures the corrf igu- 

45 ration vector according to the location of the error(s). 
Each BIU maintains the n bit configuration vector that 
indicates the BIU perception of the status of each of the 
n BlUs in the FTM. For the purpose of the present 
invention n=6 because there are six BlUs in the 2 by 3 

so matrix array. The first analysis determines if the error(s) 
are characterized as self-implicating at 220. If one or 
more self-implicating errors exist, ah error message is 
asserted on the mesh bus 52 from a decision at 221 . All 
BlUs continuously send a phased heartbeat signal to 

55 the BlUs via the mesh bus 52. During fault-free opera- 
tion the heartbeat signal is continually cycling at a par- 
ticular cycling phase known to all the BlUs. If the 
heartbeat signal of a BIU is received out of phase, the 
BIU is killed or removed from voting. A non-retryable 



6 



11 



EP0 811 916 A2 



12 



error is an error that causes an erroneous change in the 
Bill's heartbeat signal. The self-implicated BIU shuts 
down at 223 and sends a change of its phased heart- 
beat signal to all the BlUs in the FTM. A phase changed 
heartbeat signal indicates that the configuration vector 5 
will require and must undergo reconfiguration. The indi- 
vidual BIU reaction to a changed heartbeat signal is 
described below with reference to FIGURE 20. A config- 
uration consistency procedure 237 performs any neces- 
sary reconfiguration of the configuration vector. The w 
configuration consistency procedure performs a config- 
uration voting algorithm (CVA) and reassigns the master 
BlUs accordingly, described below with reference to 
FIGURES 19—22. Once complete, the system returns 
to the FDIR procedure 208 (FIGURE 10) to continue 75 
checking for any other errors of the same phase 
(address or data). 

The three other error types are isolated through 
separate processing passes in the FID procedure. Syn- 
chronization consistency errors, memory bus errors, 20 
and CPU bus errors are isolated and reconfigured 
according to predefined algorithms. The synchronous 
consistency algorithms are described below with refer- 
ence to FIGURES 18A and 19A. The reconfiguration 
algorithm for memory bus errors is described below with 25 
reference to FIGURE 14A. The CPU bus error algorithm 
is descrbed below with reference to FIGURE 1 5A. After 
an error is purged by the corresponding algorithm, the 
configuration consistency procedure (FIGURE 12) per- 
forms any necessary reconfiguration followed by a so 
return to the FDIR procedure, for further error process- 
ing. 

As shown in FIGURE 12, each BIU receives, at 
240, the configuration vector from ail the other BlUs. 
The CVA then performs a comparison of each bit within 35 
the configuration vectors at 241, 242 and 245. First the 
BlUs in the memory channel vote on a bit in the config- 
uration vector at 241 , see below with reference to step 1 
of FIGURE 19A. Then results of the memory channels* 
votes are voted on for each bit in the configuration vec- 40 
tor at 242 and the BlUs voting dead of a determined 
active BIU is eliminated at 245, see below with refer- 
ence to step 2 of FIGURE 19A Upon completion of the 
CVA, each BIU compiles a new consensus configura- 
tion vector at 246 and compares this to a prestored ref- 45 
erence configuration vector at 247, see below with 
reference to FIGURE 19 and 20. If no match is found, 
the mesh bits is asserted with an error and the configu- 
ration consistency procedure begins again. If the vec- 
tors match, the FID procedure is complete. The 50 
consensus vector is saved as new reference vector at 

252, a heartbeat error is indicated rf a BIU is faulty at 

253. Mastership is assigned at 254 and 255 and the 
process is returned to the FDIR procedure. The steps 
and circuitry of the configuration consistency procedure ss 
are described in more detail below with references to 
FIGURES 19—22. 

As noted, the address phase and data phase pro- 
cedures are performed prior to 110 of FIGURE 5 and 



during the fault management procedure (at 197, 201, 
210 and 215 of FIGURES 9 and 10). 

Reconfiguration Algorithms 

FIGURE 14A illustrates the two step memory bus 
error algorithm performed at 230 of FIGURE 1 1. In step 
1, the memory channel master BIU rechecks for any 
self-implicating errors, column A.. Elimination of the 
master row 2, column B occurs if the master self-impli- 
cates. If the master does not self-implicate, step 2 initi- 
ates row 2, column A. Step 2 is a majority BIU agree or 
disagree vote of the data the respective channel master 
placed on the memory bus 51 . If the majority decision of 
the BlUs agree with the memory bus data, the row 1, 
column A transaction proceeds, row 1 , column B, and 
any disagreeing BIU is eliminated, row 2, column C. If 
the majority BIU decision disagrees with the master 
memory bus data, row 2, column A, the transaction is 
retried, row 2, column B, and the minority BIU is elimi- 
nated, row 1 , column C. In the case of a tie, row 3, col- 
umn C, the entire memory channel is eliminated, row 3, 
column B. FIGURE 14B enumerates some configura- 
tions and actions. In one example, column B, from FIG- 
URE 14B, the memory channel master does not self- 
implicate itself. The master and one other BIU in the 
same memory channel agree, rows 1 . 2. and the last 
BIU in the same memory channel disagrees, row 3, with 
the data the master placed on the memory bus 51. The 
result is that the transaction proceeds and the disagree- 
ing BIU is eliminated, row 4. 

FIGURE 15A illustrates the two step CPU bus error 
algorithm performed at 235 of FIGURE 11. Step 1 is 
similar to step 1 of the memory bus error algorithm of 
FIGURE 14A except the CPU channel's master BIU is 
the master BIU of interest. Step 2 is a strict majority vote 
of the data the master CPU channel BIU places on the 
CPU bus 50. Step 2 is similar to step 2 described in FIG- 
URE 14A above. 

FIGURE 15B enumerates some configurations and 
corresponding actions. Column E is described below for 
an example. First, the CPU channel master BIU deter- 
mines no self -implication exists. The CPU channel mas- 
ter BIU agrees with what it placed on the CPU bus 50, 
row\ Only one other BIU is active in the same CPU 
channel and this BIU disagrees, row 2, with the data the 
master BIU placed on the CPU bus 50. With no majority 
result this CPU channel is eliminated, column 14, and 
no further transaction processing is performed by this 
CPU channel. 

As shown in FIGURE 16A, faults indicated by errors 
relating to FTM addressing 1 76a, 1 80a of FIGURE 7 are 
resolved by a two step majority vote of the BlUs 54 in 
the FTM. In step 1. the BlUs 54 on each of the CPU 
channels vote if they think the CPU 56 is addressing the 
FTM. In step 2, the results from all of the CPU channel 
votes are voted. Ties in the first step are neutral to the 
outcome of the second step. Specific error information 
relating to the FTM address flag and the mesh consist- 
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ency is placed on the mesh bus 52 to nullify the validity 
of the flag and mesh consistency of an otherwise active 
BIU. The results of most cases agree with that of a sim- 
ple majority. As shown in FIGURE 16B, the fault syn- 
drome columns A, B. C represent the CPU columns and 
each of the two responses in a column represents a vot- 
ing BIU on a separate memory channel. In the example 
of row 4, a tie is experienced by the addressing votes of 
the BlUs in coliimn A, B CPU channels and a consensus 
yes address vote of the two BlUs in the right CPU chan- 
nel column C. The resulting majority CPU channel vote 
is yes. column D. The FTM is being addressed, the 
transaction is initiated and the BlUs that voted not 
addressed, top left and bottom center BIU, are elimi- 
nated, column E, F. The "Comments** columns for FIG- 
URES 16-19 are reserved for comments regarding the 
corresponding row results. 

Minority case exceptions are noted by "minority" 
and "tie breaker" in the comment column. For the minor- 
ity cases, the response is more conservative than a sim- 
ple majority vote. For the tie breaker case, the FTM 
remains active because the vote of CPU channel votes 
gives a majority to one of the possible outcomes even 
though the absolute number of BIU votes on each side 
of the question is the same. Note that several resultant 
configurations are disjoint such that there may be two 
single channels with no cross channel data compari- 
sons, although mesh synchronization is still in place. 

The voting algorithm 176b, 180b of FIGURE 7 and 
226 of FIGURE 11 is shown in FIGURE 17A. In step 1, 
if the majority CPU channel vote decides that the trans- 
action is inconsistent, row 2, then a memory exception 
is initiated in step 2, row 2. A tie results in CPU channel 
shutdown, row 3, and a consistent majority vote initiates 
the transaction, row 1, and eliminates the minority vot- 
ing BlUs, row 1. FIGURES 17B-D enumerate the 
generic configurations and actions with an explanation 
on notation at the end of the table. The results of most 
cases agree with that of a simple majority. The excep- 
tions are noted by "minority" and "tie breaker" in the 
Comment column. For the minority results, the results 
are more conservative than a simple majority vote. As 
with the vote procedure of FIGURE 16A, disjoint config- 
urations are possible as a result of the vote. One exam- 
ple of the above algorithm is shown in row 18 of 
FIGURE 17B. Five BlUs are able to vote, see columns 
A-C. As in FIGURE 16B, each of columns A-C repre- 
sents the BlUs connected to the CPU bus and the frac- 
tioned row represents a BIU able to vote in one of the 
two memory buses. Column A vote result is a tie. Col- 
umn B vote is for a consistent transaction and column C 
result is for an inconsistent transaction. As a result of 
the votes above, the CPU channels combined vote is a 
tie because of the one tie. one consistent and one 
inconsistent vote. The tie vote results in the FTM shut- 
ting down as indicated by column E-G. 

As shown in FIGURES 18A-F, the nullification vote 
of 182, 185, 186 of FIGURE 7 differs from the previous 
two synchronization consistency error algorithms in that 



the vote takes place after the action has occurred. The 
decision to nullify or continue, column B of step 2 must 
be made by each BIU without the benefit of a synchro- 
nizing transaction on the mesh bus 52. Therefore, in 

5 order to maintain synchronization, a post-action vote is 
performed and the disagrees immediately shut down, 
column C, to avoid loss of synchronization, see step 2. 
This must be done because the nullification/continua- 
tion decision results in an irrevocable change of state in 

w the system. Nullification occurs for three reasons: the 
CPUs 56 assert a nullification signal; the other FTM 
asserts a hold signal; or self -shutdown of a BIU. Only 
the first two cases are significant because the third case 
is self-resolving. A discrepancy in the nullification vote 

75 due to either nullification or hold signal assertion, 
implies either a data error or the loss of synchronization 
across the CPU channels. 

FIGURE 18A specifies the two-step fault identifica- 
tion process for the nullification mesh vote. The nullify 

20 decision is asserted onto the mesh bus 52 between 
counts of the clock, in step 1 of FIGURE 18A, the BlUs 
for each CPU channel are voted, generating votes for 
the CPU channels. In step 2, all of the CPU channel 
votes from step 1 are voted. If a BIU 54 recognizes that 

25 its nullification decision did not match the decision of the 
mesh vote, then that BIU shuts down column C and 
broadcasts the shut down by changing its heartbeat 
signal. The change in the BlU's heartbeat signal is an 
immediate indication to the FTM that a BIU has killed 

30 itself. Surviving BlUs detect the heartbeat change and 
perform the configuration consistency procedure to 
generate a new configuration vector. Note that the result 
of the vote generates only an individual error, not a new 
FTM configuration. FIGURES 18B-F enumerate some 

35 generic configurations and corresponding actions with 
an explanation on notation at the end of the table. For 
one example see the row 10 of FIGURE 18B. The left 
and center CPU channels, column A, B, experience a tie 
of the included BlUs. The right CPU channel vote, col- 

40 umn C, is to continue, not nullifying the transaction. 
Thus, the combined vote of the CPU channels is to con- 
tinue, column D. The BlUs voting to nullify the continued 
transaction above are eliminated, column E, because 
these BlUs disagree with the result. 

45 FIGURE 19A illustrates the CVA performed at 241 , 
242 and 245 of FIGURE 12. FIGURES 19B-E enumer- 
ate generic configurations and actions with a notation 
explanation. In step 1, the CVA performs a memory 
channel vote on each vector bit of the configuration vec- 

50 tors generated by the BlUs 54. In step 2, the memory 
channel votes are combined to form an intermediate 
result. The final step, column C is called "dead 
removal." Dead removal refers to the marking as dead 
any BIU that voted the subject BIU dead in a vote in 

55 which the result had the subject BIU remaining active, 
row 1 , column C. ff the combination memory channel 
vote results in a majority dead decision, row 2, or a tie 
the BIU is eliminated, row 3. 

Trie results of most cases agree with that of a sim- 
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pie majority. The exceptions are noted by "minority" in 
the comment column of FIGURES 19B — E. For those 
exceptions, the present invention is more conservative 
than a simple majority result and vote as inactive the 
unit in question. With only two memory channels there 
are no "tie breaker" cases such as those in the other 
mesh votes in which the majority of channel votes yields 
a decision in a case in which the absolute number of 
BIU votes on each side of the question is the some. 
Most cases are expected to be unanimous in which 
case the majority vote is clearly correct. Tile dead 
removal procedure is invoked for a small fraction of 
cases to ensure that whatever caused an inconsistency 
in configuration vectors to appear across the FTM is 
removed. Row 10 of FIGURE 19B is one example of a 
result from the CVA. This particular vote is on the left 
upper-most BIU in the FTM. The result of step 1 is both 
memory channels voting the BIU dead. Thus, the result- 
ing combined memory channels' vote is dead and there- 
fore removal of the left most BIU, column D, E. 

FIGURE 20 illustrates the configuration consist- 
ency circuitry. The configuration consistency circuitry 
includes a Stateln register 92 that outputs data to a CVA 
circuit 93. The output of CVA circuit 93 passes through 
a multiplexer (MUX)98a to a new state register 91 and 
through a second MUX 98b to an old state register 90. 
Also, CVA circuit output and an output of the new state 
register 91 is received by a StateMismatch comparator 
94. Old state register 90 outputs directly to a opti gener- 
ator 95 and through an AND gate 97 to a columnmaster 
generator 96. Changed heartbeat signals are also 
received by AND gate 97 and the output of AND gate 97 
is received by the CVA circuit 93. The old state register 

90 always stores the current configuration vector. A con- 
figuration is implemented by storing the configuration 
vector into the old state register 90. A killmask marks 
either a BIU that has a pending, non-retryable error, as 
signaled by a changed heartbeat signal, or any BIU 
whose heartbeat is not phased properly. The BIU per- 
forms the reconfiguration procedure and generates a 
new configuration vector, it stores that configuration 
vector in a new state register 91 . The new state register 

91 performs a circular shift of the 6 bit configuration vec- 
tor, asserting it onto a StateOut line and ending the shift 
with the vector aligned again within the new state regis- 
ter 91 . The StateOut line is passed to ail BlUs 54 on the 
mesh bus 52 via a MUX 99a and reenters the circuit at 
the Stateln register 92. 

At this point, each BIU 54 receives at Stateln regis- 
ter 92 a copy of the configuration vectors that each BIU 
54 in the FTM has generated and stored in the new 
state register 91. The CVA, as defined previously in 
FIGURE 19A, is invoked and a new, composite configu- 
ration vector is generated from the data in the Stateln 
register 92 and any possible killed BlUs as received 
through AND gate 97. If the composite configuration 
vector matches the contents of the new state register 
91 , as determined at StateMismatch comparator 94, the 
process is complete and the new configuration is imple- 



mented. If, however, an active BIU has a miscompare at 
comparator 94 indicating that the composite conf igura- 
tion vector does not match what is stored in its new 
state register 91, the comparator asserts StateMis- 

5 match onto syndrome line of the mesh bus 52 via a 
MUX 99b. The new composite configuration vector is 
docked into state register 91 by MUX 98b, the previous 
vector stored in a new state register 91 is pushed to the 
mesh bus 52 and the procedure is repeated until no 

to miscompare exists at comparator 94. At that point, con- 
vergence is achieved with 

The construction of the circuitry of FIGURE 20 is 
driven by the need to handle conditions that may occur 
during initialization. As shown in FIGURE 20, for initiali- 

15 zation of the mesh of this invention, the new state regis- 
ter 91 is loaded through MUX 98a with a prestored 
configuration vector retrieved from the EEPROM and 
the configuration consistency procedure is invoked 
using that vector as an initial value. It is possible that the 

20 configuration vector retrieved by the BlUs of one mem- 
ory channel does not match the vector retrieved by the 
BlUs of other memory channels. Consider the case in 
which an entire memory channel is shut down during 
operation. The EEPROM from that memory channel 

25 does not have the updated configuration vector stored 
into it because that channel can no longer respond to 
transactions. It contains a configuration vector that has 
some of the BlUs of the inactive channel marked as still 
active. In the initialization procedure, all BlUs 54 start as 

30 active until shown otherwisa Thus, the next time that 
the system is initialized, the previously deactivated 
memory channel is active until either individual checks 
or the configuration consistency procedure illustrate it to 
be inactive. By weighting the mesh configuration vote to 

35 require a clear majority vote of memory channels, the 
inactive status of the bad memory channel is deter- 
mined if only one of the memory channels contains the 
accurate vector indicating that the bad channel should 
be inactive. Conversely, for a BIU 54 to remain active, 

40 the final vote of memory channels must be explicitly 
positive. A vote by either memory channel to kill a BIU 
results in the deactivation of that BIU. Therefore, con- 
sistency is maintained. 

After successful determination of the configuration 

45 vector, the assignment of memory channel and CPU 
channel masters is determined as a function of the con- 
figuration vector. In order to avoid bus contention prob- 
lems, the Mastership status is determined from the old 
state register 90 contents which should be consistent. 

so The algorithm for determining memory channel Master- 
ship is shown in FIGURE 21 . When the master memory 
channel BIU is deactivated, mastership assignment 
increments column position until an active BIU is 
reached. Memory channel mastership is simple 

55 because memory channel master assignments have 
priority over CPU channel master assignments. 

The rules for determining CPU channel mastership 
are more complicated because the CPU channel mas- 
tership may be passed from a still-active BIU to its col- 
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umn pair in order to avoid the situation in which a BIU is 
both memory channel and CPU channel master. The 
algorithm for determining CPU channel mastership is 
shown in FIGURE 22. Three possible cases exist if both 
BlUs in a CPU channel are active. In case 1 neither BIU 5 
is the memory channel master; therefore, the CPU 
channel master retains mastership. Case 2 has the 
same result when both BlUs are memory channel mas- 
ters. In case 3 one BIU is a memory channel master and 
CPU mastership assignment goes to the nonmemory 10 
channel master. When a single BIU is active in the CPU 
channel, it is the master. 

The present invention provides improved fault iden- 
tification and reconfiguration performed in a mesh array 
computer architecture because of a number of impor- 15 
tant factors previously described. This system's ability 
to categorize errors as to their type and location and to 2. 
reconfigure the mesh architecture according to the iden- 
tified errors, are two features that enable the improve- 
ments of this invention. 20 

While the preferred embodiment of the invention 
has been illustrated and described, it will be appreciated 
that various changes can be made therein without 
departing from the spirit and scope of the invention. 

25 3. 

Claims 

1. A fault-tolerant computer system, including one or 
more central processing units (CPUs) each con- 
nected to a CPU bus and one or more memory unit 30 
each connected to a memory bus, said CPU 
bus(es) intersecting said memory bus(es), said sys- 
tem further comprising: 4. 

bus interface units (BlUs) located at the inter- 35 
sections of the CPU bus(es) and the memory 
bus(es); 

a transaction means for transmitting a transac- 
tion comprising at least one of address data, 
read data, write data, and control data; 40 
a mesh bus for interconnecting predetermined 
BlUs; 

said BlUs further comprising: 



number of times, if a retryable error 
remains asserted on the mesh bus; 
a fault management means for isolating 
and eliminating said BIU(s) with any 
detected error(s) remaining after said retry 
means by reconfiguring the configuration 
vector of each BIU connected on the mesh 
bus according to the type of remaining 
detected error(s) and generating a consen- 
sus configuration vector according to the 
reconfigured configuration vector of each 
BIU; and 

a continuation means for completing the 
transaction if no error(s) remain asserted 
on the mesh bus. 

The fault-tolerant computer system of daim 1, 
wherein said error detecting means further com- 
prises: 

a single thread read back means for detecting 
errors of data written to memory when only a 
single CPU channel is active. 

The fault tolerant computer system of claim 1 or 2, 
wherein said error detecting means further com- 
prises: 

a single thread readback means for detecting 
errors of data written in memory when only dis- 
joint BIUS are active. 

A fault tolerant computer system according to claim 
1, 2 or 3, wherein said fault management means 
further comprises: 

a self-implicating error means within the fault 
management means of each BIU for detecting 
self-implicating errors, shutting down each BIU 
that detects self-implicating error, asserting the 
configuration vector of each BIU onto the mesh 
bus, and generating a consensus configuration 
vector. 



a means for storing a configuration vector 45 5. 
representing a BlU's knowledge of the 
operating status of the BlUs connected to 
the mesh bus; 

an error detecting means for detecting cor- 
rectable, retryable and non-retryable errors 50 
in said data of a transaction; 
an error correcting means for correcting 
any detected correctable errors; 
an error asserting means for asserting to 
all the BlUs via the mesh bus if a retryable 55 
or non-retryable error(s) was detected by 
said error detecting means; 
a retry means for sending the data back to 
said error detecting means a preset 6. 



A fault tolerant computer system according to any 
of claims 1-4, wherein said fault management 
means comprises: 

a synchronization error means within the fault 
management means of each BIU for detecting 
synchronization errors, asserting a first error 
message to the mesh bus if a synchronization 
error was detected, performing a first reconfig- 
uration algorithm if a first error message is 
asserted on the mesh bus and generating a 
consensus configuration vector according to 
the results of the first reconfiguration algorithm. 

A fault tolerant computer system according to any 
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of claims 1-5, wherein said fault management 
means comprises: 

memory bus error means within the fault man- 
agement means of each BIU for detecting 5 
memory bus errors, asserting a second error 
message to the mesh bus if a memory bus 
error was detected, performing a second 
reconfiguration algorithm if a second error 
message is asserted on the mesh bus and 10 
generating a consensus configuration vector 
according to the results of the second reconfig- 
uration algorithm. 

7. A fault tolerant computer system according to any 15 
of claims 1-6, wherein said fault management 
means comprises: 

CPU bus error means within the fault manage- 
ment means of each BIU for detecting CPU bus 20 
errors, asserting a third error message to the 
mesh bus if a CPU bus error was detected, per- 
forming a third reconfiguration algorithm if a 
third error message is asserted on the mesh 
bus and generating a consensus configuration 25 
vector according to the results of the third 
reconfiguration algorithm. 

8. A fault-tolerant computer system according to any 

of claims 1-7, wherein said fault management 30 
means transmits data through the error detecting 
means in the following order: 

the self-implicating error means; 

the synchronization error means; 35 

the memory bus error means; and 

the CPU bus error means. 

9. A fault-tolerant computing method, including one or 
more central processing units (CPUs) each con- 40 
nected to a CPU bus and one or more memory 
units each connected to a memory bus, each CPU 
bus intersects each memory bus, said method com- 
prising the steps of: 

45 

transmitting a transaction comprising at least 
one of address data, read data, write data, or 
control data; 

connecting predetermined BlUs by a mesh 
bus; so 
detecting within said BlUs correctable, retrya- 
ble and non-retryable errors in said data of the 
transaction; 

correcting any correctable errors detected; 
asserting an error message to all the BlUs via 55 
the mesh bus if a retryable or non-retryable 
error(s) was detected; 

generating a configuration vector for each BIU, 
said configuration vector representing the 



BlU's knowledge of the operating status of all 
BlUs connected to the mesh bus; 
completing the transaction if no error mes- 
sages are asserted on the mesh bus; 
sending the data back to said error detecting 
step a preset number of times if a retryable 
error(s) remains asserted on the mesh bus; 
isolating and eliminating within said BlUs any 
remaining retryable or non-retryable error(s) by 
reconfiguring the configuration vector of each 
BIU connected on the mesh bus according to 
the type of detected error(s) and generating a 
consensus configuration vector according to 
the reconfigured configuration vector; and 
completing the transaction according to the 
reconfigured configuration vector. 

10. A fault-tolerant computing method according to any 
of claims 1 -9, further comprising the steps of: 

storing the consensus configuration vector; 
and 

regenerating a consensus configuration vector, 
if the consensus conf iguration vector does not 
match a previously stored consensus configu- 
ration vector. 

11. A fault-tolerant computer system, including a cen- 
tral processing unit (CPU) connected to a vertical 
CPU bus and at least one memory unit connected 
to a horizontal memory bus, said system further 
comprising: 

a bus interface unit (BIU), wherein the BIU is 
located at an intersection of the vertical CPU 
bus and the horizontal memory bus; 
a transaction means for transmitting a transac- 
tion comprising at least one of address data, 
read data, write data, or control data; 
said BIU further comprising: 

an error detecting means for detecting cor- 
rectable and uncorrectable errors in said 
transmitted data of a single transaction, 
wherein said step of detecting further com- 
prises the steps of: 

a read back means for reading back said 
data of the transaction if the transaction is 
a write transaction and detecting the read 
back data for correctable and uncorrecta- 
ble errors; 

an error correcting means for correcting 
any correctable errors detected; 
a retry means for sending the data back to 
said error detecting means a preset 
number of times if an uncorrectable 
error(s) is detected; 

a shutdown means for shutting down the 
BIU if an uncorrectable error(s) remains 
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after completion of the retry means; and 
a continuation means for completing the 
transaction if said BIU is not shutdown. 

12. A fault-tolerant computing method, including a cen- 5 
tral processing unit (CPU) connected to a vertical 
CPU bus and at least one memory unit connected 
to a horizontal memory bus, said buses intersect in 
a mesh format, said method comprising the steps 
of: io 

connecting a bus interface unit (BIU) at the 
intersection of the buses; 
transmitting a transaction comprising at least 
one of address data, read data, write data, or 15 
control data; 

detecting within said BIU correctable and 
uncorrectable errors in said transmitted data of 
a single transaction, wherein said step of 
detecting further comprises the steps of: 20 

reading back said data of the completed 
transaction if the transaction is a write 
transaction and detecting the read back 
data for correctable and uncorrectable 25 
errors, correcting the data if the check indi- 
cates a correctable error is present; 
correcting within said BIU if any correcta- 
ble errors detected; 

sending the data back to said error detect- so 
ing step a preset number of times if an 
uncorrectable error(s) is detected; 
shutting down the BIU if an uncorrectable 
error remains after completion of the step 
of sending the data back; and 35 
completing the transaction if said BIU is 
not shutdown. 



40 
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BEGIN 

TRANSACTION 



) 




ASSERT ERROR MESSAGE 
TO MESH BUS IF ERROR(S) 
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CORRECT ALL CORRECTABLE 
ERROR(S) 




CONTINUE 
PROCESSING 



FAULT MANAGEMENT 
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CONTINUE 
PROCESSING 
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NEW 
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if 
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READBACK OR 
MEMORY SCRUBf 
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BW LATCHES 
ADDRESS 
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ADDRESS DECODED 

CACHE HIT/MISS 
DETERMINATION 



-142 



SYNCHRONIZATION 
PROCEDURE 
TO FIG. 7 
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BW DRIYES ADDRESS 

TO MEMORY 
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NULLIFICATION 
PROCEDURE 
TO FIG. 7 
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BW HOLDS 
OLD ADDRESS 
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ADDRESS DECODED 

CACHE HIT/MISS 
DETERMINATION 



145 



SYNCHRONIZATION 
PROCEDURE 
TO FIG. 7 
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BIU DRIVES ADDRESS 
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ASSERTED 



RETURN 
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FROM 145 OF FIG. 6 



BEGIN 
SYCHR0NI2ATI0N 
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READ TRANSACTION 
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DATA CHECKED BY EMC 
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BlU DRIVES DATA ONTO CPV BUS 
CPU CONTROLS ASSERTED 



C RETURN J < 



165, 



DATA CHECKED BY EDAC 
DATA COMPARED TO CPV DATA 



I662 

BIU LATCHES MEMORY \ 



163j_ 



MEMORY CONTROLS ASSERTED 




I 



fRITE TRANSACTION 
< 15 1 ^167b 



BIU LATCHES CPU DATA 



155 



MEMORY BUS ECC 
CENERATWN 



157 



BIU DRIVES DATA 
ONTO MEMORY BUS 
MEMORY CONTROLS 
ASSERTED 



TO FIG. 13 



:158 



INTERRUPT PARAMETERS 

MTTEN TO 
INTERRUPT REGISTER 



INTERRUPT 



~167? 



] S 



161 



BIU DRIES DATA 
ONTO MEMORY BUS 
MEMORY CONTROLS 
ASSERTED 



S 



162 



MEMORY BUS ECC 
CENERATWN 



READBACK TRANSACTION 



SCRUB TRANSACTION 
( 167d 



19 



EP0 811 916 A2 




SET RETRY 
COUNT = 0 



ADDRESS PHASE 
PROCEDURE 
TO FIG. 6 
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IF ERROR: 
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DATA PEASE 
PROCEDURE 
TO FIG. 8 
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IF ADDRESS OR DATA 
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CONTINUE 
PROCESSING 
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FROM 218 OF FIG. 10 
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FROM 237 OF FIG. 
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RETURN 
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NOTATION: 

A) FAULT SYNDROME: FU - FLAC VALVES FOR CPU CHANNELS MWs 

Y - MIU BELIEVES IT IS BEINC ADDRESSED 
N - MIU BELIEVES IT IS NOT BEING ADDRESSED 

B) RESULTS: C - CONTINUE OPERATION 

I - REMAIN IDLE 
S- SHUTDOWN 

S(S:Y*=*N) - SHUT DOWN (THE RESULT HITH ALL ft AND N's 
SWITCHED WOULD ALSO BE SHUT DOWN) 

C) NEXT STATE: SJ - DESIRED NEXT STATE FOR CPU CHANNEL J MWs 

1/1 - UNITS IN MEMORY CHANNELS 1 & 0 REMAIN ACTIVE 
1/0 - UNIT IN MEMORY CHANNEL 1 REMAINS ACTIYE, 
UNIT IN MEMORY CHANNEL 0 IS DEAD 
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1/1 


I/O 




//. 


Y/N 


Y/Y 


V 


C(M:Y<=>N) 


I/O 


1/1 


I/O 




12. 


Y/Y 


Y/Y 


V 


C(M:Y<=>N) 


1/1 


1/1 


O/O 




13. 


N/Y 


N/Y 


V 


L C(M:Y<=>N) 


o/i 


o/i 


i/o 




14. 


N/Y 


Y/N 


V 


C(U:Y*=>N) 


0/1 


I/O 


I/O 




15. 


Y/N 


Y/N 


V 


C(M:Y*=*N) 


I/O 


I/O 


I/O 




16. 


N/N 


Y/Y 


V 


C(M:Y*=*N) 


O/O 


1/1 


I/O 




17. 


N/Y 


Y/Y 


V 


S(S:Y<=*N) 


O/O 


O/O 


O/O 


MINORITY 


18. 


Y/N 


Y/Y 


N/ 


S(S:Y< — *N) 


O/O 


O/O 


O/O 


MINORITY 
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Y/Y 


Y/Y 






4 /4 


4 /i 

1/1 


v/v 




it hi 

N/Y 


Y/Y 




r/i/.y. kw) 


0/1 
v/t 


1/1 

l/I 


o/o 

V/ V 




n/y 


jj/y 
N/Y 






0/0 

Vf V 


0/0 


0/0 




N/Y 

N/Y 


Y/N 




0(0. J" 1 •'iiy 


0/0 


0/0 


0/0 

v/v 




11/ if 


Y/Y 




tf?-y< — >N) 


0/0 


0/0 

f 


0/0 




Y/Y 


Y/ 


Y/ 




1/1 


1/0 


I/O 

'/ 




n/y 

N/1 


Y/ 


Y/ 


VI XQ • 1 ' / 


0/1 


1/0 


1/0 




y/n 
Y/N 


Y/ 

V 


Y/ 

V 


VI in » J " Jiy 


1/0 


1/0 


1/0 




v/v 
Y/Y 


'/ 


Y/ 

V 


tl m* JP 'if y 


l/l 

V 


0/0 

Ml V 


1/0 

l/V 




11 /v 

N/N 


v/ 


v/ t 


n/j/M >W) 

t|m.I v 'iiy 


ft/ft 
V/V 


1/0 
l/V 


1/0 


TIF BREAKER 

lib lJl\JjalxJjI\ 


N/Y 


u/ 

«/ 1 


v/ 
Y/ 


OloJ^^^^iY/ 


O/l) 
V/V 


o/o 

v/v 


o/o 

V/ V 




v/v 

Y/N 


"/ 


V/ 

V 


0(0. J* — 


o/o 
v/v 


0/0 

V/ V 


o/o 

V/ V 




v/v 

Y/Y 


p 


Y/ 


L(if. *NJ 


1/1 
l/l 


0/1 
V/l 


1/0 

l/V 




N/Y 




V/ 
V 


pfu.Yt — Nitf) 


0/1 
v/\ 


0/1 

V/ 1 


1/0 

l/V 




Y/Y 




V/ 
V 


p/u.y^ — k») 


1/1 
l/l 


0/0 

V/ V 


1/0 

l/U 




N/N 

a/a 


tY 


V/ 
V 


f?tfT< — 

1/1 JO. J 1 r JI y 


0/0 

v/ V 


O/i 


1/0 


TIE BREAKER 


N/Y 


In 


1/ 

V 


S (S:Y<==^N) 


0/0 


0/0 


0/0 




N/Y 


A 


N/ 


S(S:Y*=*N) 


0/0 


0/0 


0/0 




Y/Y 


Y/ 




C(M:Y^N) 


1/1 


1/0 


0/0 




N/Y 


V 




C(M:Y<=>N) 


o/i 


1/0 


0/0 




Y/N 


Y/ 




C(M:Y^N) 


1/0 


I/O 


0/0 




Y/Y 


N/ 




S(S:Y^=*N) 


0/0 


0/0 


0/0 


MINORITY 


V 


V 


Y/ 


C(M:Y*°=>N) 


I/O 


1/0 


1/0 






r/ 


Y/ 


C(M:Y*=*N) 


0/0 


1/0 


i/o 




A 


Y/ 


Y/ 


C(li:Y*=*N) 


o/i 


1/0 


1/0 


DISJOINT 


A 


Y/ 


Y/ 


C(M:Yt=*N) 


0/0 


1/0 


1/0 


DISJOINT 


A 


«/ 


Y/ 


C(U:Y<=*N) 


o/i 


0/0 


1/0 


DISJOINT 
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2 


Y/Y 






C(M:Y<=*N) 


1/1 


0/0 


0/0 






N/Y 






S(S:Y^=*N) 


0/0 


0/0 


0/0 






V 


V 




C(M:Y*=*N) 


1/0 


i/o 


0/0 






V 


V 




S(S:Y*=>N) 


0/0 


0/0 


0/0 






V 


A 




C(M:Y<=*N) 


1/0 


o/i 


0/0 


DISJOINT 




If/ 


A 




S(S:Y*=>N) 


0/0 


0/0 


0/0 


DISJOINT 


1 


V 






C(M:Y<=*N) 


1/0 


0/0 


0/0 





NOTATION: 

A) FAULT SYNDROME: FLAG VALUES FOR CPU CHANNEL J BIUs 

Y - BIU THINKS TRANSACTION IS CONSISTENT 
N - BIU THINKS TRANSACTION IS INCONSISTENT 



B) RESULTS: 



C) NEXT STATE: 



C - CONTINUE OPERATION 
M - MOKE MEXC 
S - SHUT DOWN 

S(S:Y*=*N) - SHUT DOM (THE RESULT WITH ALL Y's AND JV's 
SWITCHED ¥OULD ALSO BE SHUT DOWN) 

S_Z - DESIRED NEXT STATE FOR CPU CHANNELS BIUs 
1/1 - UNITS IN MEMORY CHANNELS 1 & 0 REMAIN ACTIVE 
I/O - UNIT IN MEMORY CHANNEL 1 REMAINS ACTIVE, 
UNIT IN MEMORY CHANNEL 0 IS DEAD 
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STEP 1 CPU CHANNEL VOTE 



MAJORITY DECISION 



NUI1IFY 



CONTINUE 



TIE 



STEP 2 


MAJORITY (CPU CHANNEL VOTE) 




MAJORITY DECISION 

A. 


IF BIU: TOOK FOLLOflNC ACTION 
D. 1 


TEEN BIU i WILL 
C. 1 


NULLIFY 


NULLIFIED 


CONTINUE 


NULLIFY 


CONTINUED 


SHUTDOWN 


CONTINUE 


NULLIFIED 


SHUTDOWN 


CONTINUE 


CONTINUED 


CONTINUE 


TIE 


NULLIFIED/CONTINUED 


SHUTDOWN 



NUMBER 
OF ACTIVE 
UNITS 


FAULT SYNDROME 


VOTE RESULT 


NEXT STATE (BWj) 


COMMENTS 


NulL2 
A. 


NulU 
B. 


NvlU 
C. 


D. 


BWj 

YOTEDTO 

„ NULLIFY 
A. 


BWj 

VOTED TO 

„ CONTINUE 
F* 




6 


1. 


N/N 


N/N 


N/N 


N 




X 




2. 


C/N 


N/N 


N/N 


N 




0 




3. 


C/N 


C/N 


N/N 


N 




0 




1 


C/N 


N/C 


N/N 


N 




0 




5. 


C/C 


N/N 


N/N 


N 




0 




6. 


C/N 


C/N 


C/N 


S(S:N^C) 


0 


0 




7. 


C/N 


C/N 


N/C 


S(S:N*=*CL 


0 


0 




8. 


C/C 


C/N 


N/N 


S(S:N^C) 


0 


0 




S. 


N/C 


N/C 


C/C 


C 


0 






/ft 


N/C 


C/N 


C/C 


C 


0 






//. 


N/N 


C/C 


C/C 


C 


0 






12. 


N/C 


C/C 


C/C 


C 


0 






a 


C/C 


C/C 


C/C 


C 


X 
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5 


/. 


N/N 


N/N 


N/ 


N 


1 


x 






2. 


C/N 


N/N 


«/ 


N 


/ 


0 






3. 


N/C 


N/N 




N 


/ 


0 






4. 


N/N 


N/N 


<y 


N 


/ 


0 






5. 


C/N 


C/N 


V 


N 


/ 


0 






6. 


C/N 


N/C 


*/ 


N 




0 






7. 


N/C 


N/C 


«/ 


N 




0 






8. 


C/C 


N/N 


V 


N 




0 






9. 


C/N 


N/N 


<y 


S(S:N*=>C) 


0 


0 


MINORITY 




10. 


N/C 


N/N 


c/ 


S(S:N*=*C) 


0 


0 


MINORITY 




//. 


N/C 


C/C 


»/ 


S(S:N*=*C) 


0 


0 






12. 


C/N 


C/C 


v 


S(S:N<=>C) 


0 


0 j 






13. 


N/N 


C/C 


c/ 


C 


0 








14. 


C/N 


C/N 


c/ 


C 


0 








IS. 


N/C 


C/N 


c/ 


C 


0 








16. 


N/C 


N/C 


c/ 


c 


0 








17. 


C/C 


C/C 


*/ 


c 


0 








18. 


C/N 


C/C 


c/ 


c 


0 








19. 


N/C 


C/C 




c 


0 








20. 


C/C 


C/C 


c/ 


c 


X 
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N/N 


N/N 




N 


4 
1 


X 




ft /if 

C/N 


it /xj 

N/N 




If 


1 






t/ii 


r/N 






o 


o 




r/N 


N/C 
ft/ 1* 




S(S-N*=*C) 


0 


0 




r/r 


n/n 






0 


0 




N/N 
n/n 


N/ 


N/ 


N 
it 


1 


X 




C/N 
i//it 


N/ 

"/ 1 


N/ 


N 
if 


1 


0 




n/p 

iY/C 


N/ 


N/ 


N 
it 


i 


0 




JIT AJ 
N/N 


r/ 

V 


N/ 
"/ 


N 


i 


o 




n /ft 

c/c 


it/ 


u/ 

N/ 


N 
if 


1 
1 


() 


TIE BREAKER 


n /if 

C/N 


n/ 

V 


«/ 




n 
u 


V 




N/C 


ft / 

V 


IT / 


n/n if . K n\ 

S[S:N*=*C) 


A 
V 


n 

V 




N/N 


A 


If / 

V 


ii 

Jv 


4 
1 


V 




C/N 


A 


If / 

V 


If 

N 


4 

J 


V 




N/N 


in 

A 


If / 


ii 
ff 


4 
1 


ft 




C/C 


/if 

A 


If / 

N/ 




4 
1 


o 

V 


TIE BREAKER 


n /if 

C/N 


A 




o(o;iv c =*L/ 


n 

V 


o 




C/N 


A 


c / 


S(S:N<=*C) 


0 


0 




N/N 


A 


t 

«/ 


C 


0 




TIE BREAKER 


C/C 


A 


c/ 


C 


0 






N/C 


A 


c/ 


C 


0 






C/C 


A 


€/ 


C 


0 






N/N 


<y 




C 


0 




TIE BREAKER 


C/C 


V 




C 


0 






C/N 


c/ 




C 


0 






N/C 


«/ 


c/ 


C 


0 






C/C 


c/ 




C 


X 






N/C 


C/C 




C 


0 






C/C 


C/C 




C 


X 
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3 


N/N 


N/ 




N 


1 


X 






N/ 




N 


1 


0 










N 


' 1 


0 




if /ir 


n / 

V 




S{S:N*=>C) 




/ft 

V 


UtNftDlTY 

MlnUKU I I 


u / 


Nf 




N 


7 




v 
A 




n/ 

V 




A// 


N 


— ! — 


n 
V 




A 


If / 

Ay 


U/ 

*7 






Y 
A 




A 


mr / 


ir / 

*/ 






0 


DISJOINT 


A 


si / 

C / 


■ r / 

V 


A" 




0 


DISJOINT 


A 






C 


0 




DISJOINT 




c/ 




C 


0 




DISJOINT 


A 


c/ 


c/ 


C 


X 


1 


DISJOINT 


V 


</ 




c 


0 


1 




</ 


<y 


V 


c 


X 


1 




c/at 


<y 




c 


0 


1 






c/ 




c 


0 






c/c 






c 


X 






2 


N/fl 






HI 
if 


i 
1 


Y 

A 




C/N 






S(S:N*=*C) 


0 


0 






V 




N 


1 


X 




c/ 


AV 




S(S:N*=*C) 


0 


0 




V 


A 




N 


1 


X 


DISJOINT 




A 




S(S:N<=>C) 


0 


0 


DISJOINT I 


c/ 


A 




C 


X 


1 


DISJOINT 


c/ 


0/ 




C 


X 


1 




c/c 






c 


X 


1 




1 








N 


1 


X 




c/ 






c 


X 


i 
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NOTATION: 

A) FAULT SYNDROME: NVLLJ - FLAG VALUES FOR CPU CHANNEL_2 BIUs AS RECTD BY BlUi 

N - BlU NULUFIED TRANSACTION 1 
C - BlU CONTINUED flTB TRANSACTION 

B) RESULTS: C - CONTINUE TRANSACTION 

N - NULLIFY TRANSACTION 
S - SHUT DOWN 

S(S:N*=*C) - SHVT DOM (TEE RESULT HITH AIL N's AND C's 
SWITCHED VOULD ALSO BE SHUT DOM) 



C) NEXT STATE: / - BlUj REMAINS ACTIVE 
0 - BlUj SHUTS DOlfN 

X - SCENERIO NOT POSSIBLE CIVEN NULLIFY FLAQ VALUES 
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STEP 1 MEMORY CHANNEL VOTE 



VALUE 



ACTIVE 



DEAD ^ 

TIE (INCLUDES <H?J" 



STEPZ 


MAJORITY (MEMORY CHANNEL VOTE) 


MAJORITY DECISION 


RESPONSE 


DEAD REMOVAL 


A. 


B. 


C. 


ACTIVE 


KEEP BlU IN ARRAY 


ELIMINATE BIU VOTING BEAR 


DEAD 


KILL BW 


N/A 


TIE 


KILL BW 


N/A 



1. 

2. 
3. 



NUMBER 
OF ACTIVE 
UNITS 




FAULT SYNDROME 


RESULT 


NEXT STATE 


COMMENTS j 




CONJ 


CONJ 


CONJ 




s_z 


s.J 


s_o 








A. 


B. 


C. 


D. 


A. 


B. 


c. 




6 


1. 


A/A 


A/A 


A/A 


S 


1/1 


1/1 


1/1 






2. 


A/A 


D/A 


A/A 


S 


1/1 


o/i 


1/1 






3. 


A/A 


D/A 


D/A 


K 


0/1 


>/' 


>/' 


MINORITY 




1 


A/A 


D/A 


A/D 


S 


1/1 


0/1 


1/0 






5. 


A/A 


D/B 


A/A 


S 


1/1 


0/0 


1/1 






6. 


A/D 


D/A 


D/A 


K 


0/1 


1/1 


1/1 






7. 


A/A 


D/A 


D/D 


K 


o/i 


1/1 


1/1 






8. 


A/A 


D/D 


D/D 


K 


0/1 


1/1 


1/1 






9. 


A/D 


D/A 


D/D 


K 


0/1 


1/1 


1/1 






10. 


A/D 


D/D 


D/D 


K 


0/1 


1/1 


1/1 






11. 


D/X 


X/X 


X/X 


K 


o/x 


X/X 


X/X 
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5 


A/A 


A/A 


V 


s 


1/1 


1/1 


i/o 




A/A 


D/A 


V 


s 


1/1 


0/1 


1/0 




A/A 


A/D 


V 


s 


1/1 


1/0 


1/0 




A/A 


A/A 




s 


'/' 


1/1 


0/0 




A/A 


D/A 




K 


0/1 


1/1 


1/0 


MINORITY 


A/A 


D/D 


V 


s 


1/1 


0/0 


1/0 




A/D 


D/A 


V 


s 


I/O 


0/1 


1/0 




A/D 


D/D 


V 


K 


0/1 


1/1 


1/0 




A/D 


D/A 




K 


0/1 


1/1 


1/0 




A/D 


D/D 




K 


o/i 


1/1 


1/0 




D/X 


X/X 


V 


K 


o/x 


X/X 


x/o 






4 


A/A 


A/A 




S 


1/1 


1/1 


0/0 




A/A 


D/A 




S 


1/1 


0/1 


0/0 




A/A 


D/D 




K 


o/i 


>/' 


0/0 




A/D 


A/D 




K 


o/i 


1/1 


0/0 




A/D 


D/A 




K 


o/i 


1/1 


0/0 




A/D 


D/D 




K 


0/1 


1/1 


0/0 




D/X 


X/X 




K 


o/x 


X/X 


0/0 




A/A 


V 


V 


S 


1/1 


1/0 


1/0 




A/A 




V 


S 


1/1 


0/0 


1/0 




A/D 


V 


V 


K 


0/1 


1/0 


1/0 


MINORITY ! 


A/A 




0/ 


K 


0/1 


1/0 


1/0 




A/D 


V 


V 


K 


0/1 


1/0 


1/0 




A/D 






K 


o/i 


1/0 


1/0 




D/X 


V 


V 


K 


o/x 


x/o 


x/o 




A/A 


A 


V 


S 


1/1 


0/1 


1/0 




A/A 


A 




S 


1/1 


0/1 


0/0 




A/D 


A 


V 


S 


1/0 


o/i 


1/0 




A/D 


A 




K 


o/i 


o/i 


1/0 




A/D 




D/ 


K 


0/1 


0/1 


1/0 




D/X 


A 


x/ 


K 


o/x 


o/x 


x/o 
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3 


k/A 


4/ 




5 


/// 


1/0 


O/O 




k/k 






s 


/// 


O/O 


O/O 




k/D 


V 






0/1 


I/O 


O/O 


MINORITY 


k/D 








0/1 


I/O 


O/O 




D/X 


V 




a: 


o/x 


I/O 


O/O 




V 


V 


V 




I/O 


I/O 


I/O 




V 


*/ 


V 




I/O 


O/O 


I/O 




V 


V 


V 


jr 


0/0 


I/O 


I/O 




V 


V 


V 




O/O 


I/O 


x/o 




V 

V 


/I 


V 




1/0 


0/1 


I/O 




V 

/ 


A 


/ 


S 


I/O 


0/1 


O/O 




V 


A 


V 

V 


K 


0/0 


0/1 

. re 


I/O 


MINORITY 


V 


A 


V 


K 


0/0 


0/1 


I/O 




V 


A 


V 


K 


0/0 


o/x 


x/o 




Z 


i0 






S 


0 


O/O 


O/O 










K 


0/1 


O/O 


O/O 










K 


o/x 


O/O 


O/O 




V 


V 




S 


I/O 


I/O 


O/O 




V 






K 


O/O 


I/O 


O/O 




*/ 






K 


0/0 


I/O 


O/O 




V 


A 




S 


I/O 


o/i 


O/O 




V 


A 




K 


O/O 


0/1 


O/O 




V 


A 




K 


O/O 


o/x 


O/O 




i 


V 






S 


I/O 


O/O 


O/O 










K 


O/O 


O/O 


O/O 
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NOTATION: 

A) FAULT SYNDROME: C0N.2 - CONFICURATION STATUS VALUE FROM CPU CHANNEL.2 BlUs 

A - BW THINKS UPPER LEFTMOST BlU IS ACTIVE 
D - BIU THINKS UPPER LEFTMOST BIU IS DEAD 
X - DON'T CARE 

B) RESULTS: S - LEFTMOST BW SURVIVES 

K-m LEFTMOST BIU 

C) NEXT STATE: SJ - DESIRED NEXT STATE FOR CPU CHANNELS BIUs 

1/1 - UNITS IN MEMORY CHANNELS I k 0 REMAIN ACTIVE 
1/0 - UNIT IN MEMORY CHANNEL 1 REMAINS ACTIVE, 

UNIT IN MEMORY CHANNEL 0 IS DEAD 
X - USED IN EXAMPLE IN WHICH TARGET BIU WAS KILLED BY DEAD 
REMOVAL. ACTUAL VALUE WOULD DEPEND UPON FAULT 
SYNDROME INPUT ACCORDING TO MESH VOTE RULES. 

NOTE: BOLD TYPE ENTRIES IN NEXT STATE COLUMN WERE 
REMOVED BY DEAD REMOVAL PROCEDURE 
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A. CASE 


B. ALGORITHM 


DEFAULT MEMORY CHANNEL 


(0,2) (1.0 


MASTERS 


A BW ML REMAIN MASTER AS LONG AS IT REMAINS ACTIVE 


WEN MASTER BW IS 


INCREMENT COLUMN POSITION OF PREVIOUS MASTER 


DEACTIVATED 


(MODULE 3) UNTIL AN ACTIVE BW IS REACHED 




SEQUENCES: 




(0,2) - > (0,0) - > (0,1) 




(1,1) ~ > 0,2) - > (1,0) 




A. CASE 


B. ALGORITHM 




. DEFAULT CPU CHANNEL 


(1.0) (0.1) (1,2) 
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